{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "*Copyright (c) Recommenders contributors.*\n",
                "\n",
                "*Licensed under the MIT License.*"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Riemannian Low-rank Matrix Completion algorithm on Movielens dataset\n",
                "\n",
                "Riemannian Low-rank Matrix Completion (RLRMC) is a matrix factorization based (vanilla) matrix completion algorithm that solves the optimization problem using Riemannian conjugate gradients algorithm (Absil et al., 2008). RLRMC is based on the works by Jawanpuria and Mishra (2018) and Mishra et al. (2013). \n",
                "\n",
                "The ratings matrix of movies (items) and users is modeled as a low-rank matrix. Let the number of movies be $d$ and the number of users be $T$. RLRMC algorithm assumes that the ratings matrix $M$ (of size $d\\times T$) is partially known. The entry at $M(i,j)$ represents the rating given by the $j$-th user to the $i$-th movie. RLRMC learns matrix $M$ as $M=LR^\\top$, where $L$ is a $d\\times r$ matrix and $R$ is a $T\\times r$ matrix. Here, $r$ is the rank hyper-parameter which needs to be provided to the RLRMC algorithm. Typically, it is assumed that $r\\ll d,T$. The optimization problem is solved iteratively using the the Riemannian conjugate gradients algorithm. The Riemannian optimization framework generalizes a range of Euclidean first- and second-order algorithms such as conjugate gradients, trust-regions, among others, to Riemannian manifolds. A detailed exposition of the Riemannian optimization framework can be found in Absil et al. (2008). \n",
                "\n",
                "This notebook provides an example of how to utilize and evaluate RLRMC implementation in **recommenders**."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Pandas version: 0.23.4\n",
                        "System version: 3.7.1 (default, Dec 14 2018, 13:28:58) \n",
                        "[Clang 4.0.1 (tags/RELEASE_401/final)]\n"
                    ]
                }
            ],
            "source": [
                "import sys\n",
                "import time\n",
                "import pandas as pd\n",
                "\n",
                "from recommenders.datasets.python_splitters import python_random_split\n",
                "from recommenders.datasets.python_splitters import python_stratified_split\n",
                "from recommenders.datasets import movielens\n",
                "from recommenders.models.rlrmc.RLRMCdataset import RLRMCdataset \n",
                "from recommenders.models.rlrmc.RLRMCalgorithm import RLRMCalgorithm \n",
                "from recommenders.evaluation.python_evaluation import rmse, mae\n",
                "from recommenders.utils.notebook_utils import store_metadata\n",
                "\n",
                "print(f\"Pandas version: {pd.__version__}\")\n",
                "print(f\"System version: {sys.version}\")\n",
                "\n",
                "%load_ext autoreload\n",
                "%autoreload 2"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Set the default parameters.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {
                "tags": [
                    "parameters"
                ]
            },
            "outputs": [],
            "source": [
                "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
                "MOVIELENS_DATA_SIZE = '10m'\n",
                "\n",
                "# Model parameters\n",
                "\n",
                "# rank of the model, a positive integer (usually small), required parameter\n",
                "rank_parameter = 10\n",
                "# regularization parameter multiplied to loss function, a positive number (usually small), required parameter\n",
                "regularization_parameter = 0.001\n",
                "# initialization option for the model, 'svd' employs singular value decomposition, optional parameter\n",
                "initialization_flag = 'svd' #default is 'random'\n",
                "# maximum number of iterations for the solver, a positive integer, optional parameter\n",
                "maximum_iteration = 100 #optional, default is 100\n",
                "# maximum time in seconds for the solver, a positive integer, optional parameter\n",
                "maximum_time = 300#optional, default is 1000\n",
                "\n",
                "# Verbosity of the intermediate results\n",
                "verbosity=0 #optional parameter, valid values are 0,1,2, default is 0\n",
                "# Whether to compute per iteration train RMSE (and test RMSE, if test data is given)\n",
                "compute_iter_rmse=True #optional parameter, boolean value, default is False"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [],
            "source": [
                "## Logging utilities. Please import 'logging' in order to use the following command. \n",
                "# logging.basicConfig(level=logging.INFO)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 1. Download the MovieLens dataset\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "65.6MB [00:25, 2.57MB/s]                            \n"
                    ]
                }
            ],
            "source": [
                "\n",
                "df = movielens.load_pandas_df(\n",
                "    size=MOVIELENS_DATA_SIZE,\n",
                "    header=[\"userID\", \"itemID\", \"rating\", \"timestamp\"]\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 2. Split the data using the Spark chronological splitter provided in utilities"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "## If both validation and test sets are required\n",
                "# train, validation, test = python_random_split(df,[0.6, 0.2, 0.2])\n",
                "\n",
                "## If validation set is not required\n",
                "train, test = python_random_split(df,[0.8, 0.2])\n",
                "\n",
                "## If test set is not required\n",
                "# train, validation = python_random_split(df,[0.8, 0.2])\n",
                "\n",
                "## If both validation and test sets are not required (i.e., the complete dataset is for training the model)\n",
                "# train = df"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Generate an RLRMCdataset object from the data subsets."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [],
            "source": [
                "# data = RLRMCdataset(train=train, validation=validation, test=test)\n",
                "data = RLRMCdataset(train=train, test=test) # No validation set\n",
                "# data = RLRMCdataset(train=train, validation=validation) # No test set\n",
                "# data = RLRMCdataset(train=train) # No validation or test set"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 3. Train the RLRMC model on the training data"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [],
            "source": [
                "model = RLRMCalgorithm(rank = rank_parameter,\n",
                "                       C = regularization_parameter,\n",
                "                       model_param = data.model_param,\n",
                "                       initialize_flag = initialization_flag,\n",
                "                       maxiter=maximum_iteration,\n",
                "                       max_time=maximum_time)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Took 44.991251945495605 seconds for training.\n"
                    ]
                }
            ],
            "source": [
                "start_time = time.time()\n",
                "\n",
                "model.fit(data,verbosity=verbosity)\n",
                "\n",
                "# fit_and_evaluate will compute RMSE on the validation set (if given) at every iteration\n",
                "# model.fit_and_evaluate(data,verbosity=verbosity)\n",
                "\n",
                "train_time = time.time() - start_time # train_time includes both model initialization and model training time. \n",
                "\n",
                "print(\"Took {} seconds for training.\".format(train_time))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 4. Obtain predictions from the RLRMC model on the test data"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [],
            "source": [
                "## Obtain predictions on (userID,itemID) pairs (60586,54775) and (52681,36519) in Movielens 10m dataset\n",
                "# output = model.predict([60586,52681],[54775,36519]) # Movielens 10m dataset\n",
                "\n",
                "# Obtain prediction on the full test set\n",
                "predictions_ndarr = model.predict(test['userID'].values,test['itemID'].values)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 5. Evaluate how well RLRMC performs"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "RMSE:\t0.809386\n",
                        "MAE:\t0.620971\n"
                    ]
                }
            ],
            "source": [
                "predictions_df = pd.DataFrame(data={\"userID\": test['userID'].values, \"itemID\":test['itemID'].values, \"prediction\":predictions_ndarr})\n",
                "\n",
                "## Compute test RMSE \n",
                "eval_rmse = rmse(test, predictions_df)\n",
                "## Compute test MAE \n",
                "eval_mae = mae(test, predictions_df)\n",
                "\n",
                "print(\"RMSE:\\t%f\" % eval_rmse,\n",
                "      \"MAE:\\t%f\" % eval_mae, sep='\\n')"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Reference\n",
                "[1] Pratik Jawanpuria and Bamdev Mishra. *A unified framework for structured low-rank matrix learning*. In International Conference on Machine Learning, 2018.\n",
                "\n",
                "[2] Bamdev Mishra, Gilles Meyer, Francis Bach, and Rodolphe Sepulchre. *Low-rank optimization with trace norm penalty*. In SIAM Journal on Optimization 23(4):2124-2149, 2013.\n",
                "\n",
                "[3] James Townsend, Niklas Koep, and Sebastian Weichwald. *Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation*. In Journal of Machine Learning Research 17(137):1-5, 2016.\n",
                "\n",
                "[4] P.-A. Absil, R. Mahony, and R. Sepulchre. *Optimization Algorithms on Matrix Manifolds*. Princeton University Press, Princeton, NJ, 2008.\n",
                "\n",
                "[5] A. Edelman, T. Arias, and S. Smith. *The geometry of algo- rithms with orthogonality constraints*. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998."
            ]
        }
    ],
    "metadata": {
        "celltoolbar": "Tags",
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.7.1"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
